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Abstract 

Anomaly detection has important applications in bio- 
surveilance and environmental monitoring. When compar- 
ing measured data to data drawn from a baseline distribu- 
tion, merely, finding clusters in the measured data may not 
actually represent true anomalies. These clusters may likely 
be the clusters of the baseline distribution. Hence, a discrep- 
ancy function is often used to examine how different mea- 
sured data is to baseline data within a region. An anomalous 
region is thus defined to be one with high discrepancy. 

In this paper, we present algorithms for maximizing sta- 
tistical discrepancy functions over the space of axis-parallel 
rectangles. We give provable approximation guarantees, 
both additive and relative, and our methods apply to any 
convex discrepancy function. Our algorithms work by con- 
necting statistical discrepancy to combinatorial discrepancy; 
roughly speaking, we show that in order to maximize a con- 
vex discrepancy function over a class of shapes, one needs 
only maximize a linear discrepancy function over the same 
set of shapes. 

We derive general discrepancy functions for data gen- 
erated from a one- parameter exponential family. This 
generalizes the widely-used KuUdorff scan statistic for data 
from a Poisson distribution. We present an algorithm run- 
ning in 0(in^ log^ n) that computes the maximum discrep- 
ancy rectangle to within additive error e, for the KuUdorff 
scan statistic. Similar results hold for relative error and 
for discrepancy functions for data coming from Gaussian, 
Bernoulli, and gamma distributions. Prior to our work, the 
best known algorithms were exact and ran in time 0(n*). 

1 Introduction 

Outlier detection (or "bump hunting" is a common 
problem in data mining. Unlike in robust clustering 
settings, where the goal is to detect outliers in order to 
remove them, outliers are viewed as anomalous events 
to be studied further. In the area of biosurveillance for 
example, an outlier would consist of an area that had an 
unusually high disease rate (disease occurence per unit 
population) of a particular ailment. In environmental 
monitoring scenarios, one might monitor the rainfall 
over an area and wish to determine whether any region 
had unusually high rainfall in a year, or over the past 
few years. 
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A formal statistical treatment of these problems al- 
lows us to abstract them into a common framework. 
Assume that data (disease rates, rainfall measurements, 
temperature) is generated by some stochastic spatial 
process. Points in space are either fixed or assumed to 
be generated from some spatial point process and mea- 
surements on points are assumed to be statistically inde- 
pendent and follow a distribution from a one-parameter 
exponential family. Also, let 6(-) be some baseline mea- 
sure defined on the plane. For instance, b{-) can be 
the counting measure (counts the number of points in 
a region), volume measure (measures volume of a re- 
gion) , weighted counting measure (adds up non-negative 
weights attached to points in a region). In biosurveil- 
lance, the counting measure (gives the region popula- 
tion) is often used to discover areas with elevated dis- 
ease risk. Weighted counting measures which aggregate 
weights assigned to points based on some attribute (e.g. 
race of an individual) have also been used (see [T^ for 
an example). Let p be a set of points generating a set 
of measurements m{p). Given a measure of discrepancy 
/ that takes as input the functions m(-) and 6(-) and 
a collection of regions S, the problem of statistical dis- 
crepancy can be defined as: 

Find the region S E S for which / invoked on the 
measurements for points in S is maximized. 

Statistical discrepancy functions arise by asking the 
following question: "How likely is it that the observed 
points in S come from a distribution that is different 
than the distribution of points in S'^T' . The function 
/ is derived using a likelihood ratio test which has high 
statistical power to detect the actual location of clusters, 
but is computationally difficult to deal with. As a 
consequence, most algorithmic work on this problem has 
focused either on fast heuristics that do not search the 
entire space of shapes, or on conservative heuristics that 
guarantee finding the maximum discrepancy region and 
will often (though not always) run in time less than the 
worst-case bound of |iS| times the function evaluation 
cost. 

Apart from identifying the region for which / is 
maximized, it is customary to evaluate the likelihood 
of the identified cluster being generated by chance, i.e., 
compute the probability (called p-value) of maximum 



discrepancy to exceed the observed maximum discrep- 
ancy under the null hypothesis of no clustering effect. 
A small p-value (e.g. < .05) will mean the identified 
cluster is statistically significant. Since the distribution 
of / is usually not analytically tractable, randomiza- 
tion tests f|12l [7]') which involve multiple instances of 
the maximum discrepancy computation are run for data 
generated from the null model. Thus, the computation 
of statistical discrepancy is the main algorithmic bot- 
tleneck and is the problem we focus on in this paper. 

1.1 Our Contributions In this paper, we present 
algorithms with non-trivial worst-case running time 
bounds for approximating a variety of statistical dis- 
crepancy functions. Our main result is a structural 
theorem that reduces the problem of maximizing any 
convex discrepancy function over a class of shapes to 
maximizing a simple linear discrepancy function over 
the same class of shapes. 

The power of this theorem comes from the fact 
that there are known algorithms for maximizing special 
kinds of linear discrepancy functions, when the class 
of shapes consists of axis-parallel rectangles. Given 
two sets of red and blue points in the plane, the 
combinatorial discrepancy of a region is the absolute 
difference between the number of red and blue points 
in it. Combinatorial discrepancy is very valuable when 
derandomizing geometric algorithms; it also appears 
in machine learning as the relevant function for the 
minimum disagreement problem, where red and blue 
points are thought of as good and bad examples for a 
classifier, and the regions are hypotheses. This problem 
(and a more general variant of it) was considered by 
Dobkin, Maass and Gunopoulos in 1995 [1], where they 
showed that combinatorial discrepancy for axis-parallel 
rectangles in the plane could be maximized exactly in 
time 0{n^ logn), far better than the 0{n*) bound that 
a brute-force search would entail. 

We show that the Dobkin et. al. algorithm can 
be extended fairly easily to work with general linear 
discrepancy functions. This result, combined with our 
general theorem, allows us to approximate any convex 
discrepancy function over the class of axis-parallel rect- 
angles. We summarize our results in Table ^ as an 
example, we present an additive approximation algo- 
rithm for the KuUdorff scan statistic that runs in time 
0(i-n^ log^ n), which compares favorably to the (trivial) 
0(n*) running time to compute an exact solution. 

Essentially, the reduction we use allows us to decou- 
ple the measure of discrepancy (which can be complex) 
from the shape class it is maximized over. Using our 
approach, if you wanted to maximize a general discrep- 
ancy function over a general shape class, you need only 



consider combinatorial discrepancy over this class. As 
a demonstration of the generality of our method, we 
also present algorithms for approximately maximizing 
discrepancy measures that derive from different under- 
lying distributions. In fact, we provide general expres- 
sions for the one-parameter exponential family of dis- 
tributions which includes Poisson, Bernoulli, Gaussian 
and Gamma distributions. For the Gaussian distribu- 
tion, the measure of discrepancy we use is novel, to the 
best of our knowledge. It is derived from maximum like- 
lihood considerations, has a natural interpretation as a 
distance, and may be of independent interest. 
Another notion of outlier detection incorporates 
a time dimension. In prospective outlier detection, 
we would like to detect the maximum discrepancy 
region over all time intervals starting from the present 
and going backwards in time. We show that linear 
discrepancy can be maximized over such time intervals 
and, as a consequence of our reduction, show that 
any convex discrepancy function can be approximately 
maximized. 

2 Related Work 

Detecting clustering effects in spatial data is a well- 
studied problem in statistics^. Much of the early 
focus has been on devising efficient statistical tests to 
detect presence of clustering at a global level without 
emphasis on identifying the actual clusters (see |31 
Chapter 8]). The spatial scan statistic, introduced by 
KuUdorff TT provides an elegant solution for detection 
and evaluation of spatial clusters. The technique has 
found wide applicability in areas like public health, 
biosurveillance, environmental monitoring etc. For 
interesting applications and detailed description of scan 
statistics, we refer the reader to [01 ^j. Generalization 
of the spatial scan statistic to a space-time scan statistic 
for the purpose of prospective surveillance has been 
proposed by KuUdorff and Iyengar jHj suggested 
the use of "expanding-in-time" regions to detect space- 
time clusters. We note that the algorithms described 
by KuUdorff are heuristics: they do not guarantee any 
bound on the quality of the solution, and do not traverse 
the entire space of regions. The regions he considers 
are circular, and cylindrical in the case of prospective 
surveillance. 

Dobkin and Eppstein 0] were the first to study effi- 
cient algorithms to compute maximum discrepancy over 
a range space. Their algorithms compute discrepancy in 



^ It goes without saying that there is a huge literature on 
clustering spatial data. Since our focus is primarily on statistically 
sound measures, a survey of general clustering methods is beyond 
the scope of this paper. 
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Table 1: Our results. For prospective discrepancy, multiply all bounds by n, and for higher dimensions, multiply 
by n2d-4^ 



a region i? as a difference between the fraction of points 
in R and the fraction of the total area represented by R. 
This measure stems from evaluating fundamental oper- 
ations for computer graphics. Their ranges were half 
spaces and axis-oriented orthants centered at the ori- 
gin, limited to the unit cube, and their results extended 
to d-dimensional spaces. Subsequently Dobkin, Gunop- 
ulous, and Maass |3| developed algorithms for comput- 
ing maximum bichromatic discrepancy over axis-parallel 
rectangular regions, where the bichromatic discrepancy 
of a region is the difference between the number of red 
points and the number of blue points in the region. 
This solves the minimum disagreement problem from 
machine learning, where an algorithm finds the region 
with the most good points and the fewest bad points, a 
key subroutine in agnostic PAC learning. 

Recently, Neill and Moore have developed a series of 
algorithms to maximize discrepancy for measures such 
as the KuUdorff spatial scan statistic. Their approach 
works for axis parallel squares and rectangles |16| . 
Their algorithms are conservative, in that they always 
find the region of maximum discrepancy. The worst- 
case running time of their algorithms is O(n^) for 
rectangles and O(n^) for fixed-size squares since the 
algorithms enumerate over all valid regions. However, 
they use efficient pruning heuristics that allow for 
significant speedup over the worst case on most data 
sets. An alternative approach by Friedman and Fisher 
in] greedily computes a high discrepancy rectangle, 
but has no guarantees as to how it compares to the 
optimal. Their approach is quite general, and works in 
arbitrary dimensional spaces, but is not conservative: 
many regions will remain unexplored. 

A one-dimensional version of this problem has been 
studied in bioinformatics The range space is now 
the set of all intervals, a problem with much simpler 
geometry. In this setting, a relative e-approximation 
can be found in O(p-n) time. 

A related problem that has a similar flavor is the 
so-called Heavy Hitters problem (2m. In this problem, 
one is given a multiset of elements from a universe, and 
the goal is to find elements whose frequencies in the 
multiset are unusually high (i.e much more than the 
average). In a certain sense, the heavy hitter problem 



fits in our framework if we think of the baseline data 
as the uniform distribution, and the counts as the 
measurements. However, there is no notion of ranges^ 
and the heavy hitter problem itself is interesting in a 
streaming setting, where memory is limited; if linear 
memory is permitted, the problem is trivial to solve, in 
contrast to the problems we consider. 

3 Preliminaries 

Let P be a set of n points in the plane. Measurements 
and baseline measures over P will be represented by 
two functions, m : P — > R and 6 : P ^ R. TZ denotes a 
range space over P. A discrepancy function is defined 
as d : (m, 6, R) R, for P e 7^. 

Let ruR = 'Epf.Rm{p)/M,bii = T.paR^ip) / B , 
where M = T.peu'^iP)^ ^ ^ J2peuKp), and U is 
some box enclosing all of P. We will assume that d 
can be written as a convex function of mR,bR. All 
the discrepancy functions that we consider in this 
paper satisfy this condition; most discrepancy functions 
considered prior to this work are convex as well. We 
can write (i(m, 6, R) as a function d' : [0, 1]^ R, where 
c?(m, 6, R) = d'{mR, bn). We will use d to refer to either 
function where the context is clear. 

Linear discrepancy functions are a special class of 
discrepancy functions where d = a ■ + (3 ■ b^ + 
7. It is easy to see that combinatorial (bichromatic) 
discrepancy, the difference between the number of red 
and blue points in a region, is a special case of a linear 
discrepancy function. 

The main problem we study in this paper is: 

Problem 3.1. (Maximizing Discrepancy) Given a 
point set P with measurements m, baseline measure b, 
a range space TZ, and a convex discrepancy function d, 
find the range R gTZ that maximizes d. 

An equivalent formulation, replacing the range R 
by the point r = {m,R, bn) is: 

Problem 3.2. Maximize convex discrepancy function 
d over all points r = {mji, bn), R lE TZ. 



■^Hierarchical heavy hitters provide the beginnings of such a 
notion. 



Assume that points now arrive with a timestamp 
t(-), along with the measurement m(-) and basehne 
In prospective discrepancy problems, the goal is 
to maximize discrepancy over a region in space and 
time defined as i? x [i, tnow], where i? is a spatial range. 
In other words, the region includes all points with a 
timestamp between the present time and some time t in 
the past. Such regions are interesting when attempting 
to detect recent anomalous events. 

Problem 3.3. (Prospective discrepancy) Given 
a point set P with measurements m, baseline measure 
b, timestamps t, a range space TZ, and a convex discrep- 
ancy function d, find the range T — (i?, [i*, cx)]), i? G TZ 
that maximizes d. 

3.1 Boundary Conditions As we shall see in later 
sections, the discrepancy functions we consider are ex- 
pressed as log- likelihood ratios. As a consequence, they 
tend to oo when either argument tends to zero (while 
the other remains fixed). Another way of looking at 
this issue is that regions with very low support often 
correspond to overfitting and thus are not interesting. 
Therefore, this problem is typically addressed by re- 
quiring a minimum level of support in each argument. 
Specifically, we will only consider maximizing discrep- 
ancy over shapes R such that mji > C/n,bR > C/n, 
for some constant C. In mapping shapes to points as 
described above, this means that the maximization is 
restricted to points in the square Sn = [C/n, 1 — C/n]^. 
For technical reasons, we will also assume that for all p, 
m{p),b{p) = 6(1). Intuitively this reflects the fact that 
measurement values are independent of the number of 
observations made. 

4 A Convex Approximation Theorem 

We start with a general approximation theorem for 
maximizing a convex discrepancy function d. Let 
£{x,y) — aix + a2y + 03 denote a linear function in 
X and y. Define an e- approximate family of d to be 
a collection of linear functions £i,£2, ■ ■ ■ ,f^t such that 
l^{x,y) — maxi<t £i{x , y) , the upper envelope of the £i, 
has the property that l^{x,y) < d{x,y) < l^{x,y) + e 
Define a relative e-approximate family of d to be a 
collection of linear functions ii,£2, ■ ■ ■ ,£t such that 
/^(x,y) <d(x,y) < {l + e)l^ix,y) 

Lemma 4.1. Let fi,^2,---,^t be an e-approximate 
family of a convex discrepancy function 
d : [0,1]^ — * M. Consider any point set 
C C [0, 1]2. Let «,y*) = argmaxp6c^»(Px,Py), 
and let {x*,y*) — aigmaxx* ,y' £iix* ,y*). Let 

d* = SUPpgcrf(Pa:,Pa); dinf = iuf qg [o,i]2 d(q^ , ) 



and let — max{l^ {x* ,y*), dinf) ■ Then 

m < d* < m + e 

If ^2, ■ ■ ■ T £t is a relative e-approximate family, 
then 

TO < < (1 + e)m 

Proof Sketch. By construction, each point 
{x*,y*,li{x*,y*)) lies on the upper envelope l^ . 
The upper envelope is convex, and lower bounds d(-), 
and therefore in each patch of l^ corresponding to 
a particular £i, the maximizing point {x*,y*) also 
maximizes d{x,y) in the corresponding patch. This 
is only false for the patch of f that supports the 
minimum of d{x,y), where the term involving djnf is 
needed. This corresponds to adding a single extra 
plane tangent to d{-) at its minimumm, which is unique 
by virtue ol d{-) being convex. 

Lemma 4.2. Let f : [0, 1]^ — > M &e a convex smooth 
function. Let / : [0, 1]^ — > R &e the linear approximation 
to f represented by the hyperplane tangent to f at 
p e [0,1]2. Then /(p) < /(p), and /(p) - /(q) < 
||p— q|pA*, where A* is the maximum value of the largest 
eigenvalue of H{f), maximized along the line joining p 
and q. 

Proof /(q) = /(p) + (q - p)^V/(p). The inequality 
/(p) < /(p) follows from the convexity of /. By 
Taylor's theorem for multivariate functions, the error 
/(P) - /(q) = (q - P)^ff(/)(p*)(q - P), where H{f) 
is the Hessian of /, and p* is some point on the line 
joining p and q. 

Writing q — p as ||q — p||x, where x is a unit vector, 
we see that the error is maximized when the expression 
x^iJ(/)x is maximized, which happens when x is the 
eigenvector corresponding to the largest eigenvalue A* 
ofi7(/). 

Let A* = suppgs^^ Ainax(i?(/)(p))- Let ep(q) = 
||p-qfA*,6«(q) = ||p-qfAV(p). 

Lemma 4.3. Let C C Sn be a set of t points such that 
for all q G S'„, miupgc ep(q)(?~esp. ep(q)) < £• Then 
the t tangent planes at the points /(p),p G C form an 
e-approximate (resp. relative e-approximate) family for 
/• 

Proof. Let C — {pi, . . . ,Pt}. Let li denote the tangent 
plane at /(pi). For all i, ^i(q) < /(q) by Lemma [4.21 
Let j = argmiui ep. (q). Then ^j(q) = maxiZi(q) < e. 
A similar argument goes through for Cp. (q) 

The above lemmas indicate that in order to con- 
struct an e-approximate family for the function /, we 



need to sample an appropriate set of points from S'„. 
A simple area-packing bound, using the result from 
Lemma 14.31 indicates that we would need 0(A*/e) 
points (and thus that many planes). However, A* is 
a function of n. A stratified grid decomposition can 
exploit this dependence to obtain an improved bound. 

Theorem 4.1. Let f : [0,1]^ R be a convex smooth 
function, and fix e > 0. Let X{n) — A*(S'„). Let 
F{n, e) he the size of an e-approximate family for f , and 
let F^{n, e) denote the size of a relative e-approximate 
family. Let X{n) — 0{n'^). Then, 



Proof. The proof is by induction. The statement is true 
for j = I from Ea. (|4.1|l . Assume it is true upto j. Then 
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'0(l/e) d = 

O(Mogilogn) 0<d<l 

O(ilogn) d^l 
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Setting j = fc in Claim 14.11 yields the expression 
logn = (1 + Y!l=^d-')h - (Etirf^O'o- Substitut- 
ing in the value of l\ from Ea. (|4.2|l . logn 



X]i=^i *)logno = l/alogriQ- The number of points 
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Let X'{n) = A(n)//„iax(rt), where /max(?^) denotes 
maxpgs,^ /(p). Then F^{n, e) has size chosen from the needed is F{n, e) 
above cases according to X'{n). 
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Proof. The relation between F^{n,e) and F{n,e) fol- 
lows trivially from the relationship between Cp (q) and 

ep(q)- 

If X{n) is 0(1) , then A* can be upper bounded by 
a constant, resulting in an e-approximate family of size 
0(l/e). The more challenging case is when A* is an 
increasing function of n. 

Suppose A* — 0{n'^) in the region Sn- Consider 
the following adaptive gridding strategy. Fix numbers 
no,ni,...nfc, nu+i = n. Let = Sna = [l/"-o, 1 - 
1/no]^. Let Ai = — Uj<iAi. Thus, Aq is a square of 
side 1 — 2/no, and each Ai is an annulus lying between 
Sm+i and Sm- Aq has area 0(1) and each Ai, i > has 
area 0(l/ni_i). In each region Ai, X*{Ai) = 0{nf). 

How many points do we need to allocate to Aq? 
A simple area bound based on Lemma 14.31 shows that 
we need A*(Ao)/e points, which is 0{nQ/e). In each 
region Ai, a similar area bound yields a value of 
0(nf /erii-i). Thus the total number of points needed 
to construct the e-approximate family is N(d, k) — 

"o/e + Eo<»<fe+i"fA"»-i- 

Balancing this expression by setting all terms equal, 

and setting li — log , we obtain the recurrence 



How large is dal Consider the case when d> 1: 
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Setting k = 17(log^ logn), F{n,e) = 
0(in''~-'^ log^ logn). For example, F{n,e) = 
0(irt log log n) when d — 2. Similarly, set- 
ting k — n{\ogi/^logn) when < d < 1 yields 
F(n,e)=0(Mogi/,logn). 

When d = I, = kT2- Setting k = 

il(logn), we get F{n, e) — logn). 
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Claim 4.1. Ik+i = logn = (1 + ELi c^^O'fc-j+i 



5 Algorithms for Combinatorial Discrepancy 

Lemma 5.1. (0) Combinatorial discrepancy for a set 
of red and blue points in the plane can be computed in 
time 0{n^ logn). 

Proof Sketch. [See [H] for details] A discrepancy max- 
imizing rectangle has minimal and maximal points in 
the X and y dimensions. These four points fully de- 
scribe the rectangle. By specifying the minimizing and 
maximizing y coordinates, the problem is reduced to a 
one-dimensional problem of all points within the slab 
this defines. By maintaining the maximal discrepancy 
interval in the one-dimensional problem under point in- 
sertion, only 0{n^) insertions are necessary to check the 
maximum discrepancy interval over all possible slabs, 
and thus over all possible rectangles. 

The one-dimensional problem is solved by build- 
ing a binary tree of intervals. A node corresponding 
to interval I stores the subinterval i of maximal dis- 
crepancy, the interval I of maximal discrepancy that 



includes the left boundary of /, and the interval r of 
maximal discrepancy that includes the right boundary 
of /. Two adjacent nodes, heft ■ {iieftjieft,rieft) and 
Iright ■ {■irightjright,r right), Can be merged to form 
a single nodes / : {i,l,r) in constant time, i is as- 
signed the interval with the maximum discrepancy out 
of the set {iieft, iright, rieft U rieft}. I is assigned the 
interval with the maximum discrepancy out of the set 
{heft, heft U Iright}, and r is assigned symmetrically to 
/. The entire interval, [0, 1] = / : {i,l,r), is at the root 
of the tree, and the interval which maximizes the dis- 
crepancy is i. Adding a point requires merging O(logn) 
intervals if the tree is balanced, and the tree can be 
kept balanced through rotations which only require a 
constant number of merge operations each. 

Theorem 5.1. Let TV be the set of all rectangles such 
that X^pei? '^(p) J2p£R Hp) '^''6 greater than the 

constant C. Then any linear discrepancy function of 
the form Oi ^ m(p) +a2j2^ip) + ^^3 can be maximized 
over this set in 0{n^ ^ogn) time. 

Proof. Because 03 is constant for all intervals, it can 
be ignored. Thus the linear function has the form of 
d{m, b, R) = J2peR xip)- The algorithm of |S] relies only 
on the fact that the discrepancy function is additive, 
and hence can be extended to the above discrepancy 
function by only modifying the intervals and merge 
operation in the one-dimensional case. 

Define 2' to be the set of all intervals such that 
X]pe/ '^(p) ^ ^ and J2pei Hp) ^ C'. For each interval 
/ : {i,l,r), i must be in T' and I and r must represent 
sets of intervals h . . .Ik and ri . . . rh, respectively. Ik 
(resp. rh) is the interval in X' which contains the 
left (resp. right) boundary that has the maximum 
discrepancy, h (resp. ri) is the interval which contains 
the left (resp. right) boundary that has the maximum 
discrepancy. For all i U includes the left boundary 
and \li\ < \lj\ for all i < j. Also J2pei ''^(p) ^ 
or J2peii Hp) < C' for all i < k. Finally, the set / 
must contain all local maximum; if h were to gain on 
lose one point, the discrepancy would decrease. The 
restrictions are the same for all , expect these intervals 
must contain the right boundary. 

The local optimality restriction ensures 
'EpehMp) < Epei,+i"^(p) and 'ZpehHp) < 
Sp6i +1 ^(-P)- If either measure (X]'^(-P) or J2Hp)) 
increases then the other must also increase or this would 
violate the local optimality condition. An increase of 
just a measure that increases discrepancy will cause the 
previous interval not to be optimal and an increase in 
just a measure that causes the discrepancy to decrease 
will cause the latter interval not to be optimal. Thus 
k and h are constants bounded by the number of p 



required for ™(p) — ^ and '^b{p) > C. Thus each 
interval in the tree structure stores a constant amount 
of information. 

A merge between two adjacent intervals heft ■ 

{ileft,heft,rieft) and Iright '■ {iright, hight, fright) alsO 

can be done in constant time. Computing the maximum 
discrepancy interval in T' can be done by checking iieft 
and iright versus all pairs from Iright U He/t such that 
^ m(p) > C and YIHp) ^ ^- There are a constant 
number of these. By the local optimality restriction, the 
optimal interval in I' spanning the boundary must have 
one endpoint in each set. Updating I and r can be done 
by just pruning from he ft and heft U rieft according to 
the restrictions for and similarly for r. These remain 
a constant size after the pruning. Because a merge can 
be done in constant time, a point can be added to the 
balanced tree in O(logn) time. Hence, the maximum 
discrepancy rectangle in TV for any linear discrepancy 
function can be found in 0{n^ logn) time. 

A similar argument applies if we consider prospec- 
tive discrepancy, or higher dimensions. We omit details. 

Lemma 5.2. A linear discrepancy function can be maxi- 
mized over prospective rectangles in 0{n^ logn.) time, or 
it can be maximized over axis-parallel hyper-rectangles 
in d-dimensions in time 0{n^'^~'^\ogn). 

6 One-parameter Exponential Families 

Having developed general algorithms for dealing with 
convex discrepancy functions, we now present a general 
expression for a likelihood-based discrepancy measure 
for the one-parameter exponential family. Many com- 
mon distributions like the Poisson, Bernoulli, Gaussian 
and gamma distributions are members of this family. 
Subsequently we will derive specific expressions for the 
above mentioned distribution families. 

Definition 6.1. (One-parameter exp. family) 
The distribution of a random variable y belongs 
to a one-parameter exponential family (denoted by 
y ~ lEXP{r], (jjjT, Be, a) if it has probability density 
given by 

f{y; 77) = C{y, cb)exp{{vT{y) - Be{v))/am 

where T(-) is some measurable function, a{(j)) is a 
function of some known scale parameter (/>{> 0), r] is 
an unknown parameter (called the natural parameter), 
and Be{-) is a strictly convex function. The support 
{y • fiy,v) > 0} i^ independent ofrj. 

It can be shown that E,^{T{Y)) — B^{r]) and 
Var,,(r(r)) = a{(j))B'^{T]). In general, a(0) oc <j). 



Let y = {ni : i G R} denote a set of \R\ 
variables that are independently distributed with yi ^ 
1EXP(77, (j)i,T, b, a), {i e R). The joint distribution of y 
is given by 

where 1/0* = I]iefl'(l/"(</'»))> = <t>* / , and 

Given data y, the likelihood of parameter ry is the 
probability of seeing y if drawn from a distrbution with 
parameter 77. This function is commonly expressed in 
terms of its logarithm, the log-likelihood l{r];y), which 
is given by (ignoring constants that do not depend on 
V) 

(6.3) l{v;y)^{vT*{y)-BM)/r 
and depends on data only through T* and 0* . 

Theorem 6.1. Let y — (jji : i ^ R) be independently 
distributed with yi ^ lEXP{ri,(f)i,T,b,a), {i G R). 
Then, the maximum likelihood estimate (MLE) of 77 is 
f] = ge(T*{y)), where ge = {B^)~^. The maximized 
log-likelihood (ignoring additive constants) is ^(?);y) — 
(T*(y)5e(T*(y))-i?e(.ge(r*(y))))/0*. 

Proof. The MLE is obtained by maximizing H6.3|l as 
a function of 77. Since Be is strictly convex, B^ is 
strictly monotone and hence invertible. Thus, the 
MLE obtained as a solution of l{ri;y) = is ry = 
(B^)-i(T*(y)) = 5e(r*(y)). The second part is ob- 
tained by substituting fj in H6.3|l . 

The likelihood ratio test for outlier detection is 
based on the following premise. Assume that data is 
drawn from a one-parameter exponential family. For 
a given region i?i and its complement i?2, let rjfj-^ and 
r/flj be the MLE parameters for the data in the regions. 
Consider the two hypothesis Hq : rjj^^ — rjjf^ and 
Hi : ?7_Ri 7^ 77^3. The test then measures the ratio of 
the likelihood of Hi versus the likelihood of Hq. The 
resulting quantity is a measure of the strength of Hi; 
the larger this number is, the more likely it is that Hi 
is true and that the region represents a true outlier. 
The likelihood ratio test is individually the test with 
most statistical power to detect the region of maximum 
discrepancy and hence is optimal for the problems we 
consider. A proof of this fact for Poisson distributions is 
provided by KuUdorff 13^ and extends to lEXP without 
modification. 

Theorem 6.2. Let yR. ~ (yn-i : i & Rj) be indepen- 
dently distributed withyn-i ~ lEXF{ri]i.,(f>]i.i,T,Be,a), 



for i — 1,2. The log-likelihood ratio test statistic for 
testing Hq : rjj^-^ = rjji^ versus Hi : 7]b.i 7^ rjji^ is given 
by 

(6.4) A = k{Gr, , J + k{Gr,_ , ^r, ) - k(G, $) 

where K{x,y) = {xge{x) - Bf,{ge{x)))/y, Gr^ = 
T*{yn,),l/^R, = E,ei^,(l/a(</'fl,^)), = ^/^R. + 
l/$i?,2; bRi ^ ^i/^l[+i}^^^) and G = bR.GR, + (1 - 
bRi)GR2. 

Proof. The likelihood ratio is given by 
..P,,^,,, L(,.,,,H,;yK„y.,) _ g^^stituting the MLE 

expressions 77 and rjR^ from Theorem l6.1l and setting 



G = r*(yR,,yR, 



Ej = l,2 EiGH, l/a(0-Rj.) 



{l/'fi,, + l/<fn2 ) ^ iiffiii +l/*«2 ) -bR,GR, + {l- 

bRi)GR^, the result follows by computing logarithms. 

Fact 6.1. To test Hq : rjR-^ = riR^ versus Hi : tjr-^ > 
T]R^, the log-likelihood ratio test statistic is given by 
(6.5) 

A = l{r]k, > r]k,)iKiGR,,^R,)+K{GR„^R,)-K{G,^)) 

Similar result holds for the alternative Hi : rjR-^ < rjR^ 
with the inequalities reversed. 

In the above expression for A (with Ri — R,R2 — 
R'^), the key terms are the values bR and Gr. Gr — 
T*{yji) is a function of the data (T* is a sufficient 
statistic for the distribution), and thus is the equivalent 
of a measurement. In fact, Gr, is a weighted average 
of T{y,)s in R. Thus, Gr/^r = E^ei?T(2/0/a(<^^) 
represents the total in R. Similarly, G/$ gives the 



$H G 



aggregate for the region and hence mR 
the fraction of total contained in R. Also, 1/$_r 
gives the total area of R which is independent of the 
actual measurements and only depends on some baseline 
measure. Hence, bR — gives the fraction of total 
area in R. The next theorem provides an expression for 
A in terms of mR and bR. 

Theorem 6.3. Let Ri = R and i?2 = i?^- To obtain 
R € TZ that maximizes discrepancy, assume G and $ to 
be fixed and consider the parametrization of A in terms 
ofbR and mR = bRGR/G. 

The discrepancy measure (ignoring additive con- 
stants) d{.,.) is given by 

d[mR,bR}— = mRge[G- — ) - -p^Be(ge{G — )) -\- 

Lj bR u bR 

(6.6) (1 - mR)g,iG^-^) 

^ — bR 

^^S.(5e(G^-— )) 



Proof. Follows by substituting Gr = Gf^, Grc = 
G \^™^ in Ht).4|) . simplifying and ignoring additive con- 
stants. 



7 Discrepancy Measures For Specific 
Distributions 

We can now put together all the results from the pre- 
vious sections. Section ^ showed how to map a convex 
discrepancy function to a collection of linear discrepancy 
functions, and Section [S] presented algorithms maximiz- 
ing general linear discrepancy functions over axis paral- 
lel rectangles. The previous section presented a general 
formula for discrepancy in a one-parameter exponential 
family. We will now use all these results to derive dis- 
crepancy functions for specific distribution families and 
compute maximum discrepancy rectangles with respect 
to them. 

7.1 The Kulldorff Scan Statistic (Poisson dis- 
tribution) The Kulldorff scan statistic was designed 
for data generated by an underlying Poisson distribu- 
tion. We reproduce Kulldorff 's derivation of the likeli- 
hood ratio test, starting from our general discrepancy 
function A. 

In a Poisson distribution, underlying points are 
marked for the presence of some rare event (e.g. pres- 
ence of some rare disease in an individual) and hence the 
measurement attached to each point is binary with a 1 
indicating presence of the event. The number of points 
that get marked on a region R follows a Poisson process 
with base measure h and intensity A if (i) N{%) — 0, (ii) 
N{A) ^ Poisson(A6(yl)), ^ C R,b{-) is a baseline mea- 
sure defined on R and A is a fixed intensity parameter 
(examples of b{A) include the area of A, total number of 
points in A, etc.), and (iii) the nmubcr of marked points 
in disjoint subsets are independently distributed. 

Derivation of the Discrepancy Function. A 
random variable y Poisson(A/x) is a member of 
lEXP with Tiy) = y/fi,^ = 1/M,a(0) = 0, ry = 
log(A),Be(?7) = exp{r]) , ge{x) = \og{x). For a set 
of n independent measurements with mean Xfii , i — 

1,... ,n, r*(y) = Er=i2/«/E:uM^.'^* = (EiLi/^r^- 

For a subset R, assume the number of marked points 
follows a Poisson process with base measure &(•) 
and log-intensity rjR while that in R'^ has the same 
base measure but log- intensity Ty^c. For any par- 
tition {Ai} of R and {Bj} of 7?^ {N{A,)} and 
{N{Bj)} are independently distributed Poisson vari- 
ables with mean {exp(7/i^)6(yli)} and {eyip{riRc)b(Bj)} 
respectively. Then, 1/$^ = T,a,H^^)) = KR)^ 
= b{R^), Gr = = N{R)/biR), 

Gr. = N{R-)/b{R-), and G = ^.[gtS^? ■ Hence, 



b{R)fh{R^) and niR 



N{R) 

N{R)+N(R'') ■ 



dK{bR,mR) — 



"ii?,(log(G) + log( — )) ~bR—- 
bR bR 

(l-m«)(log(G)+log(l^)) 



1-bR 



1 - ruR 
1-bR 

= niR log( 



(1 - bR) 
^) + 

OR 



(1 - ruR) log{- — ^) -t- const 
1 - bR 

and hence dK{bR,mR) = c(mi^ log(^) + (1 — 
niR,) log( )); where c > is a fixed constant. Note 
that the discrepancy is independent of the partition used 
and hence is well defined. 

MsLximizing the Kulldorff Scan Statistic. It is 
easy to see that dpc is a convex function of ttir and bR, 
is always positive, and grows without bound as either 
of TTiR and bR tends to zero. It is zero when ttir — bR. 
The Kulldorff scan statistic can also be viewed as the 
KuUback-Leibler distance between the two two-point 
distributions [mR, 1 — ttlr] and [bR, 1 — 6^]. As usual, 
we will consider maximizing the Kulldorff scan statistic 
over the region Sn — [i-/n, 1 — To estimate the 

size of an e-approximate family for dx, we will compute 
A* over Sn- 

Let fKix,y) = xln^ + (l - x)ln^. 



K 



i In- 



1-x 



In 
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"1 


1-yJ 


\y 


1 


-yj 




i;(l-a;) 



y(i-y) 

: 1 — 



\ dydx 



The eigenvalues of H{f) are the roots of the equa- 
tion \H{f) - AI| = 0. Solving for A*, and substitut- 
ing from the expressions for the partial derivatives, and 
maximizing over Sn, we obtain A* = 0(n). 

Invoking Theorem 14.11 and Theorem 15.11 

Theorem 7.1. An additive e- approximation to the 
maximum discrepancy dpc over all rectangles contain- 
ing at least a constant measure can be computed in 
time 0(-|n^ log^ n). With respect to prospective time 
windows, the corresponding maximization takes time 
0(in3 log^n). 

The Jensen-Shannon divergence is a symmetrized 
variant of the KuUback-Leibler distance. We mentioned 
earlier that dK can be expressed as a KuUback-Leibler 



distance. Replacing this by the Jensen-Shannon dis- 
tance, we get a symmetric version of the Kulldorff statis- 
tic, for which all the bounds of Theorem 17.11 apply di- 
rectly. 

7.2 Gaussian Scan Statistic It is more natural to 
use an underlying Gaussian process when measurements 
are real numbers, instead of binary events. In this sec- 
tion, we derive a discrepancy function for an underlying 
Gaussian process. To the best of our knowledge, this 
derivation is novel. 

Derivation of the Discrepancy Function. A 
random variable y that follows a Gaussian distribu- 
tion with mean and variance (denoted as y 
N{^,1/t^) is a member of lEXP with T{y) — y^rj — 
H,Beir]) = rf/2,4) = 1/T2,a((/)) ^ (j},geix) = x. For a 
set of n independent measurements with mean fi and 
variances 1/Tf,i = 1, • • • , n(known), 0* — {J27=i 
and T*(y) = YTi=\Vi'^'i I YTi=\'^'i ■ Assume measure- 
ments in R are independent N{fj,ji, l/rf), {i G R) while 
those in R'^ are independent N{fijic,l/Tf),{i G R'^). 



R 



Then, $ 
T' 



G 



R" 



Hence, — 



eR'i I 



' ^R^ - (E 
2-, and G 



(i/*i?.+i/*fl<=) 
Thus, 



13 iGi?.-j-HC 

■ and ma — 



dG\OR,mR)— = mflG-T -ttG-t — 

G 0/; Gr Or 



(1 - mH)G 



1 - -rriR 
1 



1 - ^Rq] ~ "^fl 



5fl 



G 



1 



^^to| ^ (1 - ruRf ^ _ ^ ^^irriR - bRf 



and hence dcibR^mR) 



bRil-bR) ' 



bR{l- bR) 
where c > is a 



fixed constant. Note that the underlying basehne 6(-) is 
a weighted counting measure which aggregate weights 
rf attached to points in a region. 

MsLximizing the Gaussian Scan Statistic. 
Again, it can be shown that da is a convex function 
of both parameters, and grows without bound as bR 
tends to zero or one. Note that this expression can be 
viewed as the x^-distance between the two two-point 
distributions [mR, 1 — itir], [Br, 1 — Br]. The complexity 
of an e-approximate family for da can be analyzed as 
in Section [711 Let fG{x,y) = Expressions for 

V/g and H^fc) are presented in Appendix lA.il Solv- 
ing the equation \H — AI|, and maximizing over Sn, we 
get A* = 0(n2). 

Theorem 7.2. An additive e- approximation to the 
maximum discrepancy da over all rectangles contain- 



ing at least a constant measure can be computed in time 
O(i-n'^lognloglogn). With respect to prospective time 
windows, the corresponding maximization takes time 
O ( log n log log n) . 

Trading Error for Speed For the Kulldorff 
statistic, the function value grows slowly as it ap- 
proaches the boundaries of Sn- Thus, only minor im- 
provements can be made when considering relative error 
approximations. However, for the Gaussian scan statis- 
tic, one can do better. A simple substitution shows 
that when a; = 1 - i, y = i, fG{x,y) = Q{n). Us- 
ing this bound in Theorem 14.11 we see that a rela- 
tive e-approximatc family of size logn) can be con- 
structed for dc thus yielding the following result: 

Theorem 7.3. A 1/(1 -f-e) approximation to the maxi- 
mum discrepancy da over the space of axis parallel rect- 
angles containing constant measure can be computed in 
time 0{^n^ \og^ n). 

7.3 Bernoulli Scan Statistic Modeling a system 
with an underlying Bernoulli distribution is appropri- 
ate when the events are binary, but more common than 
those that would be modeled with a Poisson distribu- 
tion. For instance, a baseball player's batting average 
may describe a Bernoulli distribution of the expectation 
of a hit, assuming each at-bat is independent. 

Derivation of the Discrepancy Function. A 
binary measurment y at a point has a Bernoulli dis- 
tribution with parameter 9 ii P{y = 1) = 6^(1 — 
ef-y. This is a member of lEXP with T{y) = 
y,V^ log(r=e)>-Se(?7) = log(l -)-exp(?7)),(/. = l,a(0) = 
1, ge{x) = log(x) - log(l - x). 

For a set of n independent measurements with 
parameter rj, 0* l/n,r*(y) = E"=i2/i/"- As- 
suming measurements in R and i?"^ are independent 
Bernoulli with parameters tjr and Ty^c respectively, 
^R = 1/\R\,^R. = 1/\R%Gr = yiR)/\R\,GR. = 

y(U)/\U\,bR - + - \R\ + \R-\ '^R - 

viR^+yiR") • ^'^^^ t-l^^t y{A) denotes the number of I's 
in a subset A. Thus, 

cj) rriR 
dB{bR,mR)— = mfllog(— )-|- 

G Or 

(1 - tur) log(- — ) + {—- uir) log(l - G— ) 

i — Or Gr Ofl, 

+ (—Q 1 + mfl)log(l - G- 



l-br 



MsLximizing the Bernoulli Scan Statistic. 

Much like dK, it is easy to see that ds is a con- 
vex function of mR and bR, is always positive, and 
grows without bound as either bR or mR tend to zero 



or one. The complexity of an e-approximate fam- 
ily for ds, the Bernoulli scan statistic, can be ana- 
lyzed by letting fsix, y) = xlog | (1 - x) log \^ + 

(§-x)log(l-Gf)-f(i^-l + x)log(l-Gi3f), 
where G is a constant. The expressions for V/s and 
Hifs) are presented in Appendix lA. 21 Direct substitu- 
tion of the parameters yields A* ~ 0{n). 



time 0{-'n?\o^ n). With respect to prospective time 



windows, the 
0{\n^ log^n). 



References 



corresponding maximization takes time 



Theorem 7.4. An additive e- approximation to the 
maximum discrepancy ds over all rectangles contain- 
ing at least a constant measure can he computed in 
time O(in^log^n). With respect to prospective time 
windows, the corresponding maximization takes time 
0{\n^ log^ n). 

7.4 Gamma Scan Statistic When events arrive one 
after another, where a Poisson variable describes the 
interval between events, then a gamma distribution 
describes the count of events after a set time. 

Derivation of the Discrepancy Function. A 
positive measurement y has a gamma distribution with 
mean /l((> 0) and shape v{> 0) if it has density 
■jpY^exp{—^y)x'^^^ and is a member of lEXP with 
T{y) = y,r^ = -i(< 0),BeW = - log(-77), = 
l/v,a{(j)) = <j),ge{x) = — ^. Following arguments 
similar to the Gaussian case, $_r — {J2ieR ^R" 



and m_R 



. Hence, bf; — 



, Grc — 



(1/$H + 1/$H<=) 

Thus, 



5fl 



G(l-mfl) 



)- 



1 



,bR{l-mR) 

bR log( J- j—r) ~ l0g( 

^1^(1 - Or) 



6fllog(^) + (l-6fl)log( 
m_R 1 



G 

Ij-ruR. 
1-bR' 

l-bR 



log(G 



mR, 



l~bR 
const 

const 



mR 



and hence ignoring additive constants, d^{bR,mR) — 
c{hR\og{^) + (1 - 6«)log(^)),c(> 0) is a fixed 
constant. For a fixed shape parameter (i.e. Ui ^ u for 



[6; 

[s; 
[9; 
[lo; 

[11 

[12: 
[is; 

[14 



each i), hR = and ruR = Y~t^^^- 

MELximizing the Gamma Scan Statistic. Be- 
cause dry = dx up to an additive constant, f-y — fx and [15 
thus A* = 0{n) for H{f^). 

Theorem 7.5. An additive e- approximation to the 
maximum discrepancy d^ over all rectangles contain- 
ing at least a constant measure can be computed in 



[16 



CORMODE, G., KORN, F., MUTHUKRISHNAN, S., AND 

Srivastava, D. Diamond in the rough: finding 
hierarchical heavy hitters in multi-dimensional data. 
In SIGMOD (2004), pp. 155-166. 

CORMODE, G., AND MUTHUKRISHNAN, S. What's 

hot and what's not: tracking most frequent items 
dynamically. In PODS (2003), pp. 296-306. 
Cressie, N. Statistics for spatial data, 2nd ed. John 
Wiley, 1993. 

DOBKIN, D., AND Eppstein, D. Computing the 
discrepancy. Symposium on Computational Geometry 
9 (1993). 

DOBKIN, D. p., GUNOPULOS, D., AND MAASS, W. 

Computing the maximum bichromatic discrepancy, 
with applications to computer graphics and machine 
learning. NeuroGOLT Technical Report Series NG- 
TR-95-008 (March 1995). 

Friedman, J. H., and Fisher, N. I. Bump hunting 
in high-dimensional data. Statistics and Gomputing 9, 
2 (April 1999), 123-143. 

Good, P. Permutation tests - a practical guide to 
resampling methods for testing hypotheses. Springer- 
Verlag, 2nd edition. New York, 2000. 
Iyengar, V. On detecting space-time clusters. KDD 
(2004), 587-592. 

J.GlAZ, and BalAKRISHNAN, N. Scan statistics and 
applications. Birkhauser, Boston, 1999. 
J.Glaz, .J.Naus, and S.Wallenstein. Scan statis- 
tics. Springer- Verlag, New York, 2001. 
LiPSON, D., Aumann, Y., Ben-Dor, A., Linial, N., 
AND Yakhini, Z. Efficient calculation of interval scores 
for dna copy number data analysis. REGOMB (2005), 
83-100. 

M.DwASS. Modified randomization tests for nonpara- 
metric hypotheses. Annals of Mathematical Statistics 
28 (1957), 181-187. 

M.KuLLDORFF. A spatial scan statistic. Gommuni- 
cations in Statistics: Theory and Methods 26 (1997), 
1481-1496. 

M.KuLLDORFF. Prospective time-periodic geographi- 
cal disease surveillance using a scan statistic. Journal 
of the Royal statistical society, series A 164 (2001), 61- 
72. 

Neill, D. B., AND Moore, A. W. A fast multi- 
resolution method for detection of significant spatial 
disease clusters. Advances in Neural Information Pro- 
cessing Systems 10 (2004), 651-658. 
Neill, D. B., and Moore, A. W. Rapid detection 
of significant spatial clusters. In KDD (2004). 



A Gradients and Hessicins 

A.l Gaussian Scan Statistic Recall that 



yi}-y)J \yi^-y) y'^i^-yf 

( 2 -2 ■2{x-y){\-2y) \ 

^{JG) - \ _2 2(x-y){l-2y) 2 , 4{x-y){l-2y)-2(x-vf , {x-yf{l-2yf I 

\y{i-y) y^a-y)^ y(i-y) ?/=(i-3/)= / 

A. 2 Bernoulli Scan Statistic Recall that 

fix, y) = x log ^H^-x) log ^+ (I -x)\og[l- + (^i^ - 1 + ar) log (^1 - C 
where G is a constant. 



V/ =i (log - - log ^ + log (l - g]-^\ - log f 1 - G- 
1 , / „x\ 1 , / ^1 — X 



1-y 
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-"Uj— -1 1 -1 , 1 G(l-rr) Gx 



y-Gx (l-y)-G(l-x) y ' l-y (i_y)((i_y)_G(l-a;)) y{y-Gx) , 



